28 research outputs found
A Unified Optimization Approach for Sparse Tensor Operations on GPUs
Sparse tensors appear in many large-scale applications with multidimensional
and sparse data. While multidimensional sparse data often need to be processed
on manycore processors, attempts to develop highly-optimized GPU-based
implementations of sparse tensor operations are rare. The irregular computation
patterns and sparsity structures as well as the large memory footprints of
sparse tensor operations make such implementations challenging. We leverage the
fact that sparse tensor operations share similar computation patterns to
propose a unified tensor representation called F-COO. Combined with
GPU-specific optimizations, F-COO provides highly-optimized implementations of
sparse tensor computations on GPUs. The performance of the proposed unified
approach is demonstrated for tensor-based kernels such as the Sparse Matricized
Tensor- Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor- Times-Matrix
Multiply (SpTTM) and is used in tensor decomposition algorithms. Compared to
state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to
3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement a
CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using
the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs
Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis
Sympiler is a domain-specific code generator that optimizes sparse matrix
computations by decoupling the symbolic analysis phase from the numerical
manipulation stage in sparse codes. The computation patterns in sparse
numerical methods are guided by the input sparsity structure and the sparse
algorithm itself. In many real-world simulations, the sparsity pattern changes
little or not at all. Sympiler takes advantage of these properties to
symbolically analyze sparse codes at compile-time and to apply inspector-guided
transformations that enable applying low-level transformations to sparse codes.
As a result, the Sympiler-generated code outperforms highly-optimized matrix
factorization codes from commonly-used specialized libraries, obtaining average
speedups over Eigen and CHOLMOD of 3.8X and 1.5X respectively.Comment: 12 page
A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
Machine Learning (ML) models contain highly-parallel computations, such as,
Matrix Multiplication, Convolutions, Dropout, etc. These computations are
commonly executed on Graphics Processing Units (GPUs), by dividing the
computation in independent processing blocks, known as tiles. Since the number
of tiles are usually higher than the execution units of a GPU, tiles are
executed on all execution units in waves. However, the tiles executed in the
last wave can under-utilize the execution units because tiles are not always a
multiple of execution units. This under-utilization can be reduced by executing
multiple independent kernels concurrently on a GPU, but is not currently
possible for dependent kernels.
In this paper, we present cuSync, a framework to write custom fine-grained
synchronization policies for dependent kernels to improve GPU utilization.
cuSync synchronizes tiles instead of kernels, which allows executing tiles of
multiple dependent kernels. Using cuSync we expressed several synchronization
policies in a few lines of code and reduced the inference times of GPT-3 and
ResNet-38 by up to 1.19x and 1.16x respectively
Characterizing and enhancing smt clustered architectures
Bibliography: p. 105-11
Krylov subspace techniques on graphic processing units
Computations related to many scientific and engineering problems spend most of their time in solving large, sparse linear systems. Improving the performance of these solvers on modern parallel architecture enables scientists to simulate large accurate models and manipulate massive amounts of data in reasonable time frames. Krylov subspace methods (KSM) are iterative techniques used to solve large sparse systems. The main time consuming kernels in KSMs are sparse matrix vector multiplication (SpMV), vector operations (dot products and vector sums) and preconditioner manipulation. This work presents techniques and algorithms to accelerate some of these kernels on a recent generation of parallel architecture called manycore processors. The performance of the proposed optimizations are tested on graphic processing units (GPUs) and compared to previous work. The SpMV kernel is accelerated on GPUs and speedups of up to 3.3 times are achieved compared to previous GPU implementations of the algorithm. The conjugate gradient iterative solver is accelerated on NVIDIA graphic cards and a 12.9 fold speedup is achieved compared to optimized implementation of the kernel on multicore CPUs. The sparse approximate inverse preconditioner is accelerated on GPUs and used to enhance the convergence rate of the BiCGStab iterative solver. The preconditioner is generated on NVIDIA GTX480 in the same time as it takes 16 AMD 252 Opteron processors to generate the same preconditioner.Communicating data between levels of a memory hierarchy and processors is time consuming and costly in KSMs. Communication-avoiding (CA) Krylov solvers take k steps of a KSM for the same communication cost as one step to reduce the communication overhead in standard KSMs. The matrix powers kernel in communication-avoiding Krylov solvers is accelerated on NVIDIA GPUs and speedups of up to 5.7 are achieved for the tested problems compared to the standard implementation of k SpMV kernels.Les calculs liés à de nombreux problèmes scientifiques et techniques demandent qu'on consacre beaucoup de temps à la résolution de grands systèmes linéaires creux. Améliorer la performance de ces résolveurs sur l'architecture paralléle moderne permet aux scientifiques de simuler de grands modèles précis et de manipuler une quantité massive de données dans des délais raisonnables. Les méthodes sous-espaces Krylov (KSM) sont des techniques itératives utilisées pour résoudre de grands systèmes creux. Les noyaux principaux qui demandent beaucoup de temps dans les KSMs sont la multiplication matrice-vecteur creuse (SpMV), les opérations sur les vecteurs (produits scalaires et sommes vectorielles) et la manipulation de préconditionneur. Ce travail présente les techniques et les algorithmes pour accélérer certains de ces noyaux sur une génération récente d'architecture parallèle appelée processeurs multicoeurs. La performance des optimisations proposées est testée sur des processeurs graphiques (GPU) et comparée aux travaux antérieurs.Le noyau SpMV est accéléré sur les processeurs graphiques et des accélérations jusqu'à 3.3 fois plus rapides sont atteintes par rapport aux implémentations de l'algorithme des processeurs graphiques précédents. Le gradient conjugué du résolveur itératif est accéléré sur des cartes graphiques NVIDIA et une accélération 12.9 fois plus rapide est réalisée par rapport à l'implémentation optimisée du noyau sur des processeurs multicœurs. Le préconditionneur approximatif inverse creux est accéléré sur les processeurs graphiques et utilisé pour améliorer le taux de convergence du résolveur itératif BiCGStab. Le préconditionneur est généré sur un NVIDIA GTX480 pour la même durée nécessaire à 16 processeurs AMD Opteron 252 pour générer le même préconditionneur.La communication de données entre les niveaux d'une hiérarchie de mémoire et des processeurs est longue et coûteuse en KSMs. Les résolveurs sans communication (communication-avoiding ou CA) de Krylov n'utilisent qu'un nombre k d'étapes d'une méthode de sous-espace de Krylov (KSM) pour un coût de communication équivalent comme une étape qui permet de réduire les frais généraux des communications dans les KSMs standards. Le noyau des pouvoirs de matrice dans les résolveurs de Krylov sans communication est accéléré sur les processeurs graphiques NVIDIA et des accélérations jusqu'à 5.7 plus rapides sont atteintes pour les problèmes testés par rapport à l'implémentation standard de k des noyaux SpMV